extreme classification
- Oceania > Australia > New South Wales > Sydney (0.14)
- Asia > Middle East > Israel (0.05)
- North America > Canada > Quebec > Montreal (0.04)
- (2 more...)
- North America > United States (0.14)
- Europe > Finland (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products
In the last decade, it has been shown that many hard AI tasks, especially in NLP, can be naturally modeled as extreme classification problems leading to improved precision. However, such models are prohibitively expensive to train due to the memory bottleneck in the last layer. For example, a reasonable softmax layer for the dataset of interest in this paper can easily reach well beyond 100 billion parameters (> 400 GB memory). To alleviate this problem, we present Merged-Average Classifiers via Hashing (MACH), a generic $K$-classification algorithm where memory provably scales at $O(\log K)$ without any assumption on the relation between classes. MACH is subtly a count-min sketch structure in disguise, which uses universal hashing to reduce classification with a large number of classes to few embarrassingly parallel and independent classification tasks with a small (constant) number of classes.
Efficient Loss-Based Decoding on Graphs for Extreme Classification
In extreme classification problems, learning algorithms are required to map instances to labels from an extremely large label set. We build on a recent extreme classification framework with logarithmic time and space (LTLS), and on a general approach for error correcting output coding (ECOC) with loss-based decoding, and introduce a flexible and efficient approach accompanied by theoretical bounds. Our framework employs output codes induced by graphs, for which we show how to perform efficient loss-based decoding to potentially improve accuracy. In addition, our framework offers a tradeoff between accuracy, model size and prediction time. We show how to find the sweet spot of this tradeoff using only the training data. Our experimental study demonstrates the validity of our assumptions and claims, and shows that our method is competitive with state-of-the-art algorithms.
- Oceania > Australia > New South Wales > Sydney (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- Asia > Middle East > Israel (0.05)
- (3 more...)
- North America > United States (0.14)
- Europe > Finland (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Reviews: Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products
This paper studies the task of extreme classification with a large amount of target categories. It developed a hashing-based algorithm, MACH. Then a classifier is trained and applied for each hash mapping, on the reduced problem with much smaller amount of target classes. The prediction results of the sub-classifiers are then combined to re-constructed the final output. The proposed methods are demonstrated to be both efficient and effective in multiple datasets.
Reviews: Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products
The paper presents a method for scaling up classifiers for tasks with extremely large number of classes, with memory requirements scaling with O(logK) for K classes. The proposed model is uses count-min sketch to transform a very large classification problem to a small number of classification tasks with a fixed small number of classes. Each of these models can be trained independently and in parallel. Experimental results on a number of multi-class and multi-label classification tasks shows that it either performs as well as other more resource-demanding approaches or it outperforms them, The methodological contribution is significant and it would be the baseline of future studies.
Navigating Extremes: Dynamic Sparsity in Large Output Space
Ullah, Nasib, Schultheis, Erik, Lasby, Mike, Ioannou, Yani, Babbar, Rohit
In recent years, Dynamic Sparse Training (DST) has emerged as an alternative to post-training pruning for generating efficient models. In principle, DST allows for a more memory efficient training process, as it maintains sparsity throughout the entire training run. However, current DST implementations fail to capitalize on this in practice. Because sparse matrix multiplication is much less efficient than dense matrix multiplication on GPUs, most implementations simulate sparsity by masking weights. In this paper, we leverage recent advances in semi-structured sparse training to apply DST in the domain of classification with large output spaces, where memory-efficiency is paramount. With a label space of possibly millions of candidates, the classification layer alone will consume several gigabytes of memory. Switching from a dense to a fixed fan-in sparse layer updated with sparse evolutionary training (SET); however, severely hampers training convergence, especially at the largest label spaces. We find that poor gradient flow from the sparse classifier to the dense text encoder make it difficult to learn good input representations. By employing an intermediate layer or adding an auxiliary training objective, we recover most of the generalisation performance of the dense model. Overall, we demonstrate the applicability and practical benefits of DST in a challenging domain -- characterized by a highly skewed label distribution that differs substantially from typical DST benchmark datasets -- which enables end-to-end training with millions of labels on commodity hardware.
- North America > United States (0.68)
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
- Europe > United Kingdom > England > Somerset > Bath (0.04)
- (2 more...)
Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products
In the last decade, it has been shown that many hard AI tasks, especially in NLP, can be naturally modeled as extreme classification problems leading to improved precision. However, such models are prohibitively expensive to train due to the memory bottleneck in the last layer. For example, a reasonable softmax layer for the dataset of interest in this paper can easily reach well beyond 100 billion parameters ( 400 GB memory). To alleviate this problem, we present Merged-Average Classifiers via Hashing (MACH), a generic K -classification algorithm where memory provably scales at O(\log K) without any assumption on the relation between classes. MACH is subtly a count-min sketch structure in disguise, which uses universal hashing to reduce classification with a large number of classes to few embarrassingly parallel and independent classification tasks with a small (constant) number of classes.